library("here")
library("edgeR")
library("tibble")
library("dplyr")
library("ggplot2")
library("DESeq2")
library("ade4")
library("kableExtra")
library("SummarizedExperiment")
library("pasilla")
library("pheatmap")
library("factoextra")
library("MASS")
library("pheatmap")
library("lattice")
library("ggcorrplot")
library("ggfortify")
library("ggrepel")
library("pracma")
library("ggridges")
High-throughput sequencing assays that provide quantitative readouts in the form of count data include;
-RNA-Seq:RNA molecules found in a population of cells or in a tissue
-ChIP-Seq: DNA regions that are bound to particular DNA-binding proteins (selected by immuno-precipitation)
-RIP-Seq:RNA molecules or regions of them bound to a particular RNA-binding protein
-DNA-Seq: genomic DNA with prevalence of genetic variants in heterogeneous populations of cells
-HiC: map the 3D spatial arrangement of DNA
-genetic screens: proliferation or survival of cells upon gene knockdown
-microbiome: abundance of different microbial species in complex microbial habitats.
Analyzing such data requires elaborate statistical techniques and considerations. This chapter addresses on such techniques with a special focus on RNA-seq.
Basic RNA-seq Workflow
The main aim is to detect and quantify systematic changes between samples from different conditions.
Statstical concepts and tools that will be employed include; - multifactorial designs, linear models and analysis of variance
generalized linear models
robustness and outlier detection
shrinkage estimation
data transformations that make data amenable to unsupervised methods
A sequencing library is the collection of DNA molecules used as input for the sequencing machine.
Fragments are the molecules being sequenced. Since the currently most widely used technology can only deal with molecules of length around 300–1000 nucleotides, these are obtained by fragmenting the (generally longer) DNA or cDNA molecules of interest.
A read is the sequence obtained from a fragment.
We bigin with gene count data from an experiment on effect of RNAi knockdown of Pasilla (Brooks et al. (2011)), the Drosophila melanogaster ortholog of mammalian NOVA1 and NOVA2, on the transcriptome.
fn = system.file("extdata", "pasilla_gene_counts.tsv",
package = "pasilla", mustWork = TRUE)
counts = as.matrix(read.csv(fn, sep = "\t", row.names = "gene_id"))
dim(counts)
## [1] 14599 7
counts[ 2000+(0:3), ]
## untreated1 untreated2 untreated3 untreated4 treated1 treated2
## FBgn0020369 3387 4295 1315 1853 4884 2133
## FBgn0020370 3186 4305 1824 2094 3525 1973
## FBgn0020371 1 0 1 1 1 0
## FBgn0020372 38 84 29 28 63 28
## treated3
## FBgn0020369 2165
## FBgn0020370 2120
## FBgn0020371 0
## FBgn0020372 27
-Heteroskedasticity
-Non-symmetric distribution
-Systematic sampling biases
-Sources of stochastic experimental variation
Designed RNA-seq experiments are usually limited in in replicates used and hence lack enough power. Hence distributional assumptions that let us compute the probabilities of rare events in the tails of the distribution are made.
These are approched differently than a DGE analysis.
Isoforms: Gene isoforms are mRNAs that are produced from the same locus but are different in their transcription start sites (TSSs), protein coding DNA sequences (CDSs) and/or untranslated regions (UTRs), potentially altering gene function. The analysis of DAST is challenging because the isoform origin for only a small fraction of the sequenced reads can be determined in a typical RNA-seq dataset Hu et al. (2018).
Analyses of such gene structures are termed, Differential alternative splicing or transcription (DAST). This is done by tools such as:
PennDiff
Cuffdiff
SplicingCompass
rSeqDiff
DiffSplice
Gene expression variability across technical replicates, has been shown to approximately follow a Poisson distribution, for which the variance is equal to the mean.
Thus the probability that a given read maps to the \(i^{th}\) gene is \(p_i = n_i/n\), and that this is pretty much independent of the outcomes for all the other reads. This can be modelled as a poison distribution with rate; \(λ_i=rp_i\) where \(r\) is the number of reads.
However, biological replication introduces additional cross-sample variability, and analysis frameworks therefore have resorted to the usage of the gamma-Poisson or the negative binomial (NB) distribution which has an additional dispersion parameter and a quadratic mean-variance relationship.
The observed counts of the features cannot be directly compared across samples since there are differences in sequencing depth across libraries; hence the need for scaling/normalizing the counts.
► Question
For the example dataset count of Section 8.3, how does the output of DESeq2’s estimateSizeFactorsForMatrix compare to what you get by simply taking the column sums?
ggplot(tibble(
`size factor` = estimateSizeFactorsForMatrix(counts),
`sum` = colSums(counts)), aes(x = `size factor`, y = `sum`)) +
geom_point()
► Task
Locate the R sources for this book and have a look at the code that produces Figure 8.1.
szfcDemo = data.frame(
x = c(2, 4, 6, 6, 8) * 10,
y = c(3, 6, 2, 9, 12) * 10,
name = LETTERS[1:5],
check.names = FALSE)
slopes = c(
blue = with(szfcDemo, sum(y) / sum(x)),
red = szfcDemo[, c("x", "y")] %>% as.matrix %>%
(DESeq2::estimateSizeFactorsForMatrix) %>% (function(x) x[2]/x[1]) %>% as.vector)
ggplot(szfcDemo, aes(x = x, y = y, label = name)) + geom_point() +
coord_fixed() + xlim(c(0, 128)) + ylim(c(0, 128)) + xlab("sample 1") + ylab("sample 2") +
geom_text(hjust= 0.5, vjust = -0.6) +
geom_abline(slope = slopes[1], col = names(slopes)[1]) +
geom_abline(slope = slopes[2], col = names(slopes)[2])
► Question
Plot the mean-variance relationship for the biological replicates in the pasilla dataset.
library("ggplot2")
library("matrixStats")
sf = estimateSizeFactorsForMatrix(counts)
ncounts = counts / matrix(sf,
byrow = TRUE, ncol = ncol(counts), nrow = nrow(counts))
uncounts = ncounts[, grep("^untreated", colnames(ncounts)),
drop = FALSE]
ggplot(tibble(
mean = rowMeans(uncounts),
var = rowVars( uncounts)),
aes(x = log(mean), y = log(var))) +
geom_hex() + coord_fixed() + theme(legend.position = "none") +
geom_abline(slope = 1:2, color = c("forestgreen", "red"))
## Warning: Removed 2713 rows containing non-finite values (stat_binhex).
The green line is expected when mean equals the mean. The red line (slope 2) corresponds to the quadratic mean-variance relationship \(v=m^2\)
Load the pasilla data.
annotationFile = system.file("extdata",
"pasilla_sample_annotation.csv",
package = "pasilla", mustWork = TRUE)
pasillaSampleAnno = readr::read_csv(annotationFile)
## Parsed with column specification:
## cols(
## file = col_character(),
## condition = col_character(),
## type = col_character(),
## `number of lanes` = col_double(),
## `total number of reads` = col_character(),
## `exon counts` = col_double()
## )
pasillaSampleAnno
## # A tibble: 7 x 6
## file condition type `number of lane… `total number of r… `exon counts`
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 treated… treated single-… 5 35158667 15679615
## 2 treated… treated paired-… 2 12242535 (x2) 15620018
## 3 treated… treated paired-… 2 12443664 (x2) 12733865
## 4 untreat… untreated single-… 2 17812866 14924838
## 5 untreat… untreated single-… 6 34284521 20764558
## 6 untreat… untreated paired-… 2 10542625 (x2) 10283129
## 7 untreat… untreated paired-… 2 12214974 (x2) 11653031
Replace type column with underscores and and convert the type and condition columns into factors, explicitly specifying the prefered order of the levels.
pasillaSampleAnno = mutate(pasillaSampleAnno,
condition = factor(condition, levels = c("untreated", "treated")),
type = factor(sub("-.*", "", type), levels = c("single", "paired")))
with(pasillaSampleAnno,
table(condition, type))
## type
## condition single paired
## untreated 2 2
## treated 1 2
Deseq2 uese DESeqDataSet to store the datasets and related metadata.
Create a DESeqDataSet from the count data matrix counts and the sample annotation dataframe pasillaSampleAnno.
mt = match(colnames(counts), sub("fb$", "", pasillaSampleAnno$file))
stopifnot(!any(is.na(mt)))
pasilla = DESeqDataSetFromMatrix(
countData = counts,
colData = pasillaSampleAnno[mt, ],
design = ~ condition)
class(pasilla)
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"
► Question
How can we access the row metadata of a SummarizedExperiment object, i.e., how can we read it out, how can we change it?
colData(pasilla)
## DataFrame with 7 rows and 6 columns
## file condition type number of lanes
## <character> <factor> <factor> <numeric>
## untreated1 untreated1fb untreated single 2
## untreated2 untreated2fb untreated single 6
## untreated3 untreated3fb untreated paired 2
## untreated4 untreated4fb untreated paired 2
## treated1 treated1fb treated single 5
## treated2 treated2fb treated paired 2
## treated3 treated3fb treated paired 2
## total number of reads exon counts
## <character> <numeric>
## untreated1 17812866 14924838
## untreated2 34284521 20764558
## untreated3 10542625 (x2) 10283129
## untreated4 12214974 (x2) 11653031
## treated1 35158667 15679615
## treated2 12242535 (x2) 15620018
## treated3 12443664 (x2) 12733865
Runing the deseq function does three things;
-estimateSizeFactors
-estimateDispersions
-nbinomWaldTest
pasilla = DESeq(pasilla)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
#View the results
res = results(pasilla)
nrow(res)
## [1] 14599
res[order(res$padj), ] %>% head
## log2 fold change (MLE): condition treated vs untreated
## Wald test p-value: condition treated vs untreated
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## FBgn0039155 730.595806139728 -4.61901333325316 0.168706777691263
## FBgn0025111 1501.41051323996 2.89986434281464 0.126920536499104
## FBgn0029167 3706.11653071978 -2.19700018176299 0.0969888426793355
## FBgn0003360 4343.03539692487 -3.17967219633152 0.143526399548808
## FBgn0035085 638.232608936723 -2.56041189472443 0.137295200724247
## FBgn0039827 261.916235943103 -4.16251579404328 0.232588799612089
## stat pvalue padj
## <numeric> <numeric> <numeric>
## FBgn0039155 -27.3789434927507 4.88598917049146e-165 4.06660878660004e-161
## FBgn0025111 22.8478733450288 1.53429589608622e-115 6.38497237156281e-112
## FBgn0029167 -22.6520919424383 1.33042414468249e-113 3.69104005206412e-110
## FBgn0003360 -22.1539187656569 9.56283102522569e-109 1.98978606557384e-105
## FBgn0035085 -18.6489540873823 1.28771779124823e-77 2.14353503531181e-74
## FBgn0039827 -17.8964584751523 1.25663334671151e-71 1.74315989077998e-68
FOur exploraoty plots can b generated; can be e
#Histogram of p-values
ggplot(as(res, "data.frame"), aes(x = pvalue)) +
geom_histogram(binwidth = 0.01, fill = "Royalblue", boundary = 0)
## Warning: Removed 2241 rows containing non-finite values (stat_bin).
The dispersion can be observed as
plotDispEsts(pasilla)
► Question
If the histogram for your data is indicative of batch effects, what can you do?
Ue sva or RUVSeq to correct for the batch effects especially if they are not known.
plotMA(pasilla, ylim = c( -2, 2))
#PCA plot of the log transformed data
pas_rlog = rlogTransformation(pasilla)
plotPCA(pas_rlog, intgroup=c("condition", "type")) + coord_fixed()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
#Heatmap
library("pheatmap")
select = order(rowMeans(assay(pas_rlog)), decreasing = TRUE)[1:30]
pheatmap( assay(pas_rlog)[select, ],
scale = "row",
annotation_col = as.data.frame(
colData(pas_rlog)[, c("condition", "type")] ))
By default, pheatmap arranges the rows and columns of the matrix by the dendrogram from (unsupervised) clustering.
#Explort results
write.csv(as.data.frame(res), file = "treated_vs_untreated.csv")
Underlying the default normalization and the dispersion estimation in DESeq2 (and many other differential expression methods) is that most genes are not differentially expressed.
What to do if assumption is not correct?
- Solution: Don't apply operations on all genes
- identify subset of -ve control genes which we belive the assumption holds
- because prior knowledge or controlled abundance as external "spike in" (PhiX) as used in MACQ2 experiment.
NB: For the normalization, although not for the dispersion estimation, one can slightly relax this assumption: it is still valid if many genes are changing, but in a way that is balanced between up- and downward directions.
► Task
Run the DESeq2 workflow with size factors and dispersion parameters estimated only from a predefined subset of genes.
pasilla1 = DESeqDataSetFromMatrix(
countData = counts,
colData = pasillaSampleAnno[mt, ],
design = ~ condition)
class(pasilla1)
## [1] "DESeqDataSet"
## attr(,"package")
## [1] "DESeq2"
p= estimateSizeFactors(pasilla1, controlGenes= (1:130))
p1=estimateDispersions(p)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
p1=nbinomWaldTest(p1)
res1=results(p1)
nrow(res1)
## [1] 14599
res[order(res1$padj), ] %>% head
## log2 fold change (MLE): condition treated vs untreated
## Wald test p-value: condition treated vs untreated
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## FBgn0039155 730.595806139728 -4.61901333325316 0.168706777691263
## FBgn0029167 3706.11653071978 -2.19700018176299 0.0969888426793355
## FBgn0025111 1501.41051323996 2.89986434281464 0.126920536499104
## FBgn0003360 4343.03539692487 -3.17967219633152 0.143526399548808
## FBgn0035085 638.232608936723 -2.56041189472443 0.137295200724247
## FBgn0039827 261.916235943103 -4.16251579404328 0.232588799612089
## stat pvalue padj
## <numeric> <numeric> <numeric>
## FBgn0039155 -27.3789434927507 4.88598917049146e-165 4.06660878660004e-161
## FBgn0029167 -22.6520919424383 1.33042414468249e-113 3.69104005206412e-110
## FBgn0025111 22.8478733450288 1.53429589608622e-115 6.38497237156281e-112
## FBgn0003360 -22.1539187656569 9.56283102522569e-109 1.98978606557384e-105
## FBgn0035085 -18.6489540873823 1.28771779124823e-77 2.14353503531181e-74
## FBgn0039827 -17.8964584751523 1.25663334671151e-71 1.74315989077998e-68
There are differences in the p values
These refer to experiments where more than one condition are assessed. this will require an extension of the linear model to accommodate all factors.
This refers to experiments with more than one factor influencing the counts.
► Question
Plot the graph of the function proposed by Huber (1964) for M-estimators.
rho = function(x, s)
ifelse(abs(x) < s, x^2 / 2, s * abs(x) - s^2 / 2)
df = tibble(
x = seq(-7, 7, length.out = 100),
parabola = x ^ 2 / 2,
Huber = rho(x, s = 2))
ggplot(reshape2::melt(df, id.vars = "x"),
aes(x = x, y = value, col = variable)) + geom_line()
Including the type and conditions into the design, a two factor formula, a two factor analysis with DESeq2 can be executed.
pasillaTwoFactor = pasilla
design(pasillaTwoFactor) = formula(~ type + condition)
pasillaTwoFactor = DESeq(pasillaTwoFactor)
## using pre-existing size factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
#View the results
res2 = results(pasillaTwoFactor)
head(res2, n = 3)
## log2 fold change (MLE): condition treated vs untreated
## Wald test p-value: condition treated vs untreated
## DataFrame with 3 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## FBgn0000003 0.171568715207063 0.674551784677737 3.87109094992796
## FBgn0000008 95.1440789963134 -0.0406731386938223 0.222214994845543
## FBgn0000014 1.05657219346166 -0.0849879584016901 2.11182144008079
## stat pvalue padj
## <numeric> <numeric> <numeric>
## FBgn0000003 0.174253664768659 0.86166611221445 NA
## FBgn0000008 -0.183035077007712 0.854770496036173 0.951974876389847
## FBgn0000014 -0.0402439130452425 0.967898668421209 NA
Comparing the P-values from both the two factor and one factor analysis;
trsf = function(x) ifelse(is.na(x), 0, (-log10(x)) ^ (1/6))
ggplot(tibble(pOne = res$pvalue,
pTwo = res2$pvalue),
aes(x = trsf(pOne), y = trsf(pTwo))) +
geom_hex(bins = 75) + coord_fixed() +
xlab("Single factor analysis (condition)") +
ylab("Two factor analysis (type + condition)") +
geom_abline(col = "orange")
The p-values in the two-factor analysis are similar to those from the one-factor analysis, but are generally smaller.
compareRes = table(
`simple analysis` = res$padj < 0.1,
`two factor` = res2$padj < 0.1 )
addmargins( compareRes )
## two factor
## simple analysis FALSE TRUE Sum
## FALSE 6973 289 7262
## TRUE 25 1036 1061
## Sum 6998 1325 8323
The two-factor analysis found 1325 genes differentially expressed at an FDR threshold of 10%, while the one-factor analysis found 1061.
Without modeling the blocking factor, the variability in the data that is due to it has to be absorbed by the \(ε\)s. This means that they are generally larger than in the model with the blocking factor.
Not taking into account a blocking factor can also lead to the detection of more genes
Transformations are useful for downstream analysis and data exploration such as clustering.
Common transformation approaches are;
logarithm transformation; \(y=log2(n+1)\) or more generally, \(y=log2(n+n_0)\) where n represents the count values and \(n_0\) is a somehow chosen positive constant.
Variance-stabilizing transformation
vsp = varianceStabilizingTransformation(pasilla)
-Regularized logarithm (rlog) transformation
Comparing vst and log2 transformation
j = 1
ggplot(tibble(
x = assay(pasilla)[, j],
VST = assay(vsp)[, j],
log2 = log2(assay(pasilla)[, j])) %>%
reshape2::melt(id.vars = "x"),
aes(x = x, y = value, col = variable)) +
geom_line() + xlim(c(0, 600)) + ylim(c(0, 9)) +
xlab("counts") + ylab("transformed")
## Warning: Removed 8234 rows containing missing values (geom_path).
It is conceptually distinct from variance stabilization in that it builds upon the shrinkage estimation.
It works by transforming the original count data to a \(log_2\)-like scale by fitting a “trivial” model with a separate term for each sample and a prior distribution on the coefficients which is estimated from the data.
► Question
Plot mean against standard deviation between replicates for the shifted logarithm ((8.17)), the regularized log transformation and the variance-stabilizing transformation.
library("vsn")
rlp = rlogTransformation(pasilla)
msd = function(x)
meanSdPlot(x, plot = FALSE)$gg + ylim(c(0, 1)) +
theme(legend.position = "none")
gridExtra::grid.arrange(
msd(log2(counts(pasilla, normalized = TRUE) + 1)) +
ylab("sd(log2)"),
msd(assay(vsp)) + ylab("sd(vst)"),
msd(assay(rlp)) + ylab("sd(rlog)"),
ncol = 3
)
## Warning: Removed 463 rows containing non-finite values (stat_binhex).
## Warning: Removed 14 rows containing missing values (geom_hex).
## Warning: Removed 21 rows containing non-finite values (stat_binhex).
## Warning: Removed 9 rows containing missing values (geom_hex).
## Warning: Removed 2 rows containing non-finite values (stat_binhex).
## Warning: Removed 20 rows containing missing values (geom_hex).
Outliers are extremely large counts They arise through;
-rare technical or experimental artifacts
-read mapping problems in the case of genetically differing samples,
-genuine, but rare biological events.
In Deseq2;
-The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates. The p values and adjusted p values for these genes are set to NA.
When there are 7 or more replicates for a given sample, the DESeq function will automatically replace counts with large Cook’s distance with the trimmed mean over all samples, scaled up by the size factor or normalization factor for that sample.
This outlier replacement only occurs when there are 7 or more replicates, and can be turned off with DESeq(dds, minReplicatesForReplace=Inf)
par(mar=c(8,5,2,2))
boxplot(log10(assays(pasillaTwoFactor)[["cooks"]]), range=0, las=2)
To detect effects that have a strong enough size, as opposed to ones that are statistically significant, the results function allow for threshold-based Wald tests: lfcThreshold, which takes a numeric of a non-negative threshold value, and altHypothesis which specifies the kind of test.
par(mfrow = c(4, 1), mar = c(2, 2, 1, 1))
myMA = function(h, v, theta = 0.5) {
plotMA(pasilla, lfcThreshold = theta, altHypothesis = h,
ylim = c(-2.5, 2.5))
abline(h = v * theta, col = "dodgerblue", lwd = 2)
}
myMA("greaterAbs", c(-1, 1))
myMA("lessAbs", c(-1, 1))
myMA("greater", 1)
myMA("less", -1 )
pasillaEdge1 <- DGEList(counts=counts, sample = pasillaSampleAnno[mt,], group=pasillaSampleAnno$condition)
design <- model.matrix(~ condition, pasillaEdge1$samples)
class(pasillaEdge1)
## [1] "DGEList"
## attr(,"package")
## [1] "edgeR"
y <- calcNormFactors(pasillaEdge1, method="TMM")
plotMDS(pasillaEdge1)
y1 <- estimateDisp(y, design, robust=TRUE)
y1$common.dispersion
## [1] 0.0228685
plotBCV(y1)
#Generalized Linear Models with Quasi-likelihood Tests
glmfit <- glmQLFit(y1, design, robust=TRUE)
plotQLDisp(glmfit)
results <- glmQLFTest(glmfit, coef = 2)
topTags(results)
## Coefficient: conditiontreated
## logFC logCPM F PValue FDR
## FBgn0039155 -4.601317 5.882317 958.0935 3.023628e-17 4.414194e-13
## FBgn0025111 2.905756 6.923428 703.5673 4.791924e-16 3.497865e-12
## FBgn0003360 -3.173036 8.452776 462.5921 1.973292e-14 9.602698e-11
## FBgn0035085 -2.548257 5.684922 437.2499 3.239556e-14 1.182357e-10
## FBgn0039827 -4.142255 4.397963 419.5859 4.653398e-14 1.358699e-10
## FBgn0034736 -3.492036 4.186934 385.9148 9.683601e-14 2.072631e-10
## FBgn0029167 -2.188103 8.221274 384.7723 9.937952e-14 2.072631e-10
## FBgn0029896 -2.434452 5.305827 333.8934 3.421243e-13 6.243341e-10
## FBgn0000071 2.685868 4.795202 300.7758 8.457310e-13 1.371870e-09
## FBgn0034434 -3.624878 3.214994 264.6460 2.545131e-12 3.715636e-09
is.de <- decideTests(results, p.value=0.05)
summary(is.de)
## conditiontreated
## Down 327
## NotSig 13919
## Up 353
plotMD(results)
abline(h=c(-1, 1), col="blue")
##the glmLRT approach##
#Conduct likelihood ratio tests for 1 vs 2 and show the top genes:
#First fit genewise glms:
fit1 <- glmFit(y1, design)
lrt <- glmLRT(fit1, coef = 2)
topTags(lrt)
## Coefficient: conditiontreated
## logFC logCPM LR PValue FDR
## FBgn0039155 -4.604477 5.882317 686.2611 2.906395e-151 4.243046e-147
## FBgn0025111 2.906356 6.923428 498.6996 1.823501e-110 1.331065e-106
## FBgn0003360 -3.173027 8.452776 352.3142 1.327984e-78 6.462413e-75
## FBgn0039827 -4.143209 4.397963 327.8901 2.769069e-73 1.010641e-69
## FBgn0029167 -2.187794 8.221274 307.5157 7.592933e-69 2.216985e-65
## FBgn0035085 -2.550032 5.684922 303.0060 7.292541e-68 1.774397e-64
## FBgn0034736 -3.494263 4.186934 262.8782 4.047158e-59 8.440638e-56
## FBgn0029896 -2.432680 5.305827 230.5551 4.511023e-52 8.232053e-49
## FBgn0000071 2.686094 4.795202 221.8658 3.543406e-50 5.747798e-47
## FBgn0034434 -3.619774 3.214994 188.8292 5.726708e-43 8.360422e-40
plotMD(lrt)
abline(h=c(-1, 1), col="blue")
#The total number of differentially expressed genes at 5% FDR is given by:
summary(decideTests(lrt))
## conditiontreated
## Down 362
## NotSig 13830
## Up 407
#Compare the Pvalues
trsf = function(x) ifelse(is.na(x), 0, (-log10(x)) ^ (1/6))
ggplot(tibble(pOne = res$pvalue,
QLF = results$table$PValue),
aes(x = trsf(pOne), y = trsf(QLF))) +
geom_hex(bins = 75) + coord_fixed() +
xlab("DEseq2") +
ylab("EdgeRQLF") +
geom_abline(col = "orange")
ggplot(tibble(pOne = res$pvalue,
Lrt = lrt$table$PValue),
aes(x = trsf(pOne), y = trsf(Lrt))) +
geom_hex(bins = 75) + coord_fixed() +
xlab("DEseq2") +
ylab("glmLRT") +
geom_abline(col = "orange")
ggplot(tibble(QLF = results$table$PValue,
Lrt = lrt$table$PValue),
aes(x = trsf(Lrt), y = trsf(QLF))) +
geom_hex(bins = 75) + coord_fixed() +
xlab("glmLRT") +
ylab("QLF") +
geom_abline(col = "orange")
Brooks, Angela N, Li Yang, Michael O Duff, Kasper D Hansen, Jung W Park, Sandrine Dudoit, Steven E Brenner, and Brenton R Graveley. 2011. “Conservation of an Rna Regulatory Map Between Drosophila and Mammals.” Genome Research 21 (2). Cold Spring Harbor Lab: 193–202.
Hu, Yu, Jennie Lin, Jian Hu, Gang Hu, Kui Wang, Hanrui Zhang, Muredach P Reilly, and Mingyao Li. 2018. “PennDiff: Detecting Differential Alternative Splicing and Transcription by Rna Sequencing.” Bioinformatics 34 (14). Oxford University Press: 2384–91.